Polygenic Health Index, General Health, and Pleiotropy: Sibling Analysis and Disease Risk Reduction

We construct a polygenic health index as a weighted sum of polygenic risk scores for 20 major disease conditions, including, e.g., coronary artery disease, type 1 and 2 diabetes, schizophrenia, etc. Individual weights are determined by population-level estimates of impact on life expectancy. We validate this index in odds ratios and selection experiments using unrelated individuals and siblings (pairs and trios) from the UK Biobank. Individuals with higher index scores have decreased disease risk across almost all 20 diseases (no significant risk increases), and longer calculated life expectancy. When estimated Disability Adjusted Life Years (DALYs) are used as the performance metric, the gain from selection among ten individuals (highest index score vs average) is found to be roughly 4 DALYs. We find no statistical evidence for antagonistic trade-offs in risk reduction across these diseases. Correlations between genetic disease risks are found to be mostly positive and generally mild. These results have important implications for public health and also for fundamental issues such as pleiotropy and genetic architecture of human disease conditions.

1 Data set

Phenotype definitions
The disease definitions used various UKB-fields to define case and control status, where an individual with any of the diagnoses or data fields filled was counted as a case for that disease. The definitions used the ICD9, ICD10 and OPCS4 codes in UKB-fields 41271, 41270, 41272; self-reported non-cancer codes from UKB-field 20002 and cancer codes in UKB-field 20001. Additionally, some diseases were specifically included in the intake questionnaire or otherwise used other UKB-fields, which also are listed below.
Most definitions did not use all possible fields such that the UKB was partly underused, i.e., there are cases that incorrectly passed as controls for many of the diseases. There might thus be some quantitative performance gains should the predictors be retrained and validated on more comprehensive phenotype definitions.

Test set demographics
The test data set consisted of 39,913 self-reported white individuals, 23,110 females and 16,803 males, with mean age 70.4 years at data download (2021) and a standard deviation of 7.3 years. A plot of the age histograms can be found in Figure 1. The disease prevalences in the test set are shown in the same figure, to the right. The UKB population is generally healthier than the general U.S. white population. Furthermore, the disease prevalence in the data set is only an approximation for the lifetime risks, as most participants still may develop any of the conditions in the future. The non-comprehensive disease definitions used also undercount the number of cases in UKB. For the sake of the index construction, we used literature values for the lifetime risks ρ d . Figure 1 shows to the right the UKB prevalences relative to the general white U.S. population. The absolute values for both lifespan impact weights l d and the lifetime risks ρ d used in the index can be found in Figure 2.
Note that the frequently used metric RRR is dependent on the prevalence in a selection experiment. This is shown in Figure 3 for theoretical RRR based on a predictor with AUC 0.64. The RRR resulting from a selection experiment decreases with higher prevalence. The precise RRR values are therefore dependent on the absolute prevalences in the population an index is evaluated on.
The index construction also includes the related lifetime risks ρ d as parameters. We chose literature estimates for the general white U.S. population for these, rather than using the UKB prevalences as estimates.    The RRR for selecting on a single predictor can be calculated theoretically using the Gaussian risk model. The metric varies with disease prevalence with lower RRR for more common diseases. This example used the fairly typical AUC of .64.

Individual Predictor Construction
Most of the predictors used in this paper were trained with the LASSO algorithm on the UK Biobank, using the methods described in Lello et al. [67,68]. Several other disease conditions were trained using the PRS-CS package [69,70] and the EUR 1000 Genomes reference panel coupled with a publicly available GWAS. For these traits, GWAS were selected that specifically excluded the UK Biobank participants in the GWAS to prevent inflated performance metrics. The GWAS were pruned by filtering down to markers which are present in the UK Biobank calls before running PRS-CS with the 1000 Genomes EUR LD panels. In addition to LASSO and PRS-CS, we used a publicly available schizophrenia predictor which was then filtered to markers which overlap the UK Biobank imputed set and filtered for p-value < 0.05 resulting in 24,387 markers. We now list the construction methods and data sources for each predictor along with the AUC on the testing set described in 1.2.

AUC evaluation
The uncertainties in AUC for each predictor in Table 1 in the main text were computed via the following algorithm. Case/control numbers and mean PRS were computed in the test set and a theoretical PRS-distribution was defined, according to equation (2) in the main text. The same numbers of cases and controls that were in the test set were sampled from the case and control parts of the PRS-distribution, respectively. An AUC was computed based on the sampled PRS and the procedure was repeated 30 times. The standard deviation from these repeated computations is the error reported next to the AUC.

Additional Diseases and Phenotypes
We examined the health index relation to 11 additional diseases that were not part of the index itself, 5 addiction categories, and computed the correlation with 5 continuous traits.
The studied diseases and their abbreviations are listed in Table 1. We tested the relationship between the health index and the binary phenotypes through t-tests, as computed by the python function scipy.stats.ttest, and present the result in Figure 4 in the form of box plots to compare the index distributions for controls and cases. Since there are systematic differences between the sexes for the index values (see section 4.2 for a sex neutral correction of this bias), we computed the t-tests and box plots for females and males separately. In each sex, only two diseases exhibited a statistically significant (p > .05) difference between the health index mean for cases and controls. Females with bipolar disorder (with borderline significance p = 0.047) or with COPD (p < .001) have on average a lower health index than the controls for these diseases. In male subjects, COPD retained its significant status whereas bipolar was consistent with no difference. Instead, rheumatoid arthritis was statistically significant among males (p < .001). We do not have an immediate explanation to why the mean difference is much more significant among males as compared to females (p = .059). In all these cases (and the just-out-of-significance CRC and RA for females), the health index is on average higher for controls than for individuals with the disease. We used the UKB online follow-up survey on addiction to examine any systematic relationships to the health index. We used UKB-fields 20401, 20406, 20431, 20456, 20503 of self-reported answers to questions of the form "Ever addicted to ?", and listed "no" as control and "yes" as case. There is a lot less data available and the overlap of answering participants and our test set was small. Consequently, there is weak statistical power in the results, presented in Figure 5. All but one t-test showed mean differences in health index between cases and controls that are consistent with zero, the exception being alcohol addiction among males for which the average health index was higher with weak statistical significance. A Bonferroni correction (either for 10 multiple test or even for 2 male/female tests) would eliminate all statistical significance for the addiction t-tests.
The five continuous phenotypes were lung capacity (forced expiratory volume (FEV) and forced vital capacity (FVC), fluid intelligence, grip strength, and height. The exact UKB definitions and covariate corrections are listed at the end of this section. All correlations are listed in Table 2 together with sample sizes and zero-slope p-values from linear regressions. All correlations were weak, height being the strongest one at .06, while all linear regression had statistically significant non-zero slopes. These corr. < 0.1 results are weak when compared to classic theoretical bounds [81] or modern empirical bounds based on replicable findings [82,83].
As a final check on the relation between the additional phenotypes and the health index, we made a linear regression with the case/control status and continuous variables as (L2normalized) features and the health index as prediction variable. For each sex, we trained and evaluated the model 10 times, setting aside 5% of the data as test set randomly each time. For females, the R 2 was .003 (std .009); for males .005 (std 0.012) with training/test sizes 9,599/505 and 6,488/341, respectively. We concluded that the additional diseases and traits were not predictive, and hence generally (linearly) independent, of the health index.

Continuous phenotypes
Lung capacity variables were z-scored, accounting for age, sex and height, as provided by UKB-fields 20256 FEV1 and 20257 FVC.
Fluid intelligence used UKB-field 20016 with the following processing: (1) mean center all instances, (2) fit and subtract a quadratic polynomial for the age dependence of all scores using the corresponding age at each instance, (3) z-score each instance, (4) for each sample, take the mean across all instances, and (5) z-score again.
Grip strength used the mean of UKB-fields 46 and 47 instance 0 for each sample. The predicted value by linear regression on age (at instance 0) was then subtracted from all values.
Height used UKB-field 50 for samples listed as genetically British, with the following processing.
(1) z-score males and females separately, (2) subtract linear regression on year of birth.

Selection experiments in genetic trios
The full RRR and index gain plots for the index selection among genetic trios are shown in Figure 6. Note that the error bars are very large and most disease RRR and index gains are inconclusive in this experiment. We also display a comparison of total index gain in DALY for pairs and trios of both siblings and unrelated individuals in Figure 7.

Sex bias adjusted health index
The index is defined with sex specific parameters l d and ρ d and includes different diseases for males (PC, TC) and females (BC). Consequently, the health index distributions are somewhat different for the two sexes. The effect is small but existant, as can be seen in Figure 8. The selection experiments are sensitive to this and the larger the group size the stronger is the dependence on the right tails, i.e., on the distribution differences for the highest health index values. As can be seen to the left in the figure, there is a larger proportion of females than males in the test set with very high health index as compared to the intermediate or lower index value regions. This is a result of the particular choice of index and test set but comports well in both direction and scale of general life expectancy differences. As a result, however, direct selection on the health index leads to an over-representation of women in the selected set. We defined a minimal non-linear transformation of the male and female health index values mapping them to their mean distribution for a sex neutral health index. The result is plotted on the negative y-axis in Figure 8, with the resulting QQ-plot to the right. Selecting  There were more females (23,110) in the test set than males (16,803). The adjustment is minor but with noticeable effect on the tails; the corresponding densities (normalizing by total number of females/males) are practically identical after the sex adjustment. Right: A QQ-plot of the female and male health index distributions before and after sex-adjustment. The plotted dots correspond to percentiles but with extra focus on the tails; the 0-3 percentiles and 97-100 percentiles are split into 40 equidistant points each such that the tail behaviors are shown clearly. The sex-adjusted distributions agree almost exactly, with a regression R 2 of 0.99 (affected only by the extreme outlier at 0.075th percentile). As such, a sex adjusted health index could therefore be used to compare the health of males to females without preference to either, as both are measured relative to their respective cohorts.
on the sex-adjusted index kept the females-to-males ratio among the selected equal to the total test set ratio. This had minor measurable effects on the index performance and the results for group size five are shown in Figure 9.

Non-European ancestry
The main part of this paper dealt exclusively with a data set of European ancestry. All predictors were trained on such a cohort and it is a well-established fact that predictor performance declines with the genetic distance between two populations (typically linearly when measured in R 2 , see for example [84]). Nevertheless, some of the performance of the Eurotrained predictors is retained when applied to other ancestries and we demonstrate here that even a composite health index has non-trivial performance for people of South Asian (SAS), East Asian (EAS), and African (AFR) ancestry. Based on self-reported ancestry in UKB, we created test sets with 9,438 (SAS), 1,493 (EAS), and 7,614 (AFR) samples, withheld from all training and hyperparameter tuning. We used the same type of index construction but excluded basal cell carcinoma and malignant melanoma because these are close to non-existant diseases in these test sets and major depressive disorder because its poor individual predictor performance.
For each test set, we used ancestry specific weights and population risks l d , ρ d . The individual disease RRR and index component gains for SAS are shown in Figure 10 for selection among groups of size 5, while the total index gains are shown for EAS and AFR in Figure  11. The RRR result is overwhelmingly positive also for South Asian ancestry, again reaching The RRR from the index selection is overwhelmingly positive also in the SAS test set. The borderline statistically significant negative RRR for IBD is the most notable difference from the EUR result and can be traced to IBD's more negative PRS-correlations with other predictors in SAS as compared to EUR (see Figure 12). Note again that the PRS is computed by the EUR trained predictor applied to SAS and may reflect population differences in linkage disequilibrium rather than underlying biology in SAS. Right: The component-wise index gain is also predominantly positive. The possibly negative RRR for IBD has almost no impact at all on the index due to its small weight and low prevalence. over or about 40% for a couple of traits (AD, HA). Notably, the Alzheimer's disease risk was reduced more for SAS than EUR in these experiments even taking the large error bars into account. This is based on only 34 SAS AD cases, however. The case numbers are always included above the x-axis in the plots for this reason. Another observation is the differences in type II diabetes. Although still with a strong relative risk reduction of 18%, the SAS result is about half the RRR of the EUR index. The SAS RRR for IBD is also worse and appears to have a borderline statistically significant negative mean value. As seen in Figure 12 below, the IBD predictor trained in EUR has more negative pairwise correlations with other disease PRS when applied to SAS. In particular BC, SCZ, T1D and T2D, with their much stronger index weights, may counter the predicted IBD risk for SAS. To be clear, the PRS correlations in the SAS data sets still refer to the predictors trained in the EUR data set and the differences may be due to distinct linkage disequilibrium patterns in general; it is still unknown what the PRS correlations would be using a training set of SAS ancestry. Lastly, we note that MDD still has a significant positive RRR despite the fact that there was no direct MDD PRS included in the South Asian health index. The SAS index gains for the components are overwhelmingly positive and dominated by CAD, heart attack, hypertension, major depressive disorder, obesity, and type II diabetes. Again, we note that MDD is contributing a lot -due to its high prevalence and strong impact -despite not being in the index directly. There was no statistically significant negative contributions to the SAS index.
The index gain from selection among EAS and AFR also performs well, as seen in Figure 11. We detect a measurable attenuation from the EUR result, as is expected due to the genetic distance from the European training population. Yet, there is a consistent and strongly significant positive gain for both EAS and AFR when using the EUR-trained predictors and ancestry specific parameters in the index construction.
The phenotypic and genetic correlation characterization of the diseases and predictors in the South Asian test set is shown in Figure 12.   Figure 8 in the paper on how to interpret this figure. The qualitative observations for the EUR ancestry are true for the SAS test set too: the studied diseases tend overwhelmingly to have positive comorbidity with one or more of the other diseases and the PRS are mostly uncorrelated or mildly positive correlated. There are a few more weakly anti-correlated pairs for SAS than for EUR, in particular for schizophrenia and IBD. The latter provides an explanation to why IBD has a worse RRR for SAS than for EUR. This may be an artifact of using EUR trained predictors on SAS ancestry and does not need to reflect the underlying genetic effects and biology. An index built from SAS trained predictor could and will answer such questions as soon as sufficient data is available.